Goto

Collaborating Authors

 contextual mdp



Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Neural Information Processing Systems

Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression \citep{simchi2020bypassing}, we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class \mathcal{M} containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(H \log T) calls to an offline density estimation algorithm (or oracle) across all T rounds. This number can be further reduced to O(H \log \log T) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class.


Reviews: PAC Reinforcement Learning with Rich Observations

Neural Information Processing Systems

Contextual MDPs are a specific type of POMDPs with the restriction that the optimal q-function depends only on the most recent observation (instead of the belief state). The authors show that Contextual MDPs are not poly PAC learneable even when either memoryless policies are considered or value function approximation is used. However, when both memoryless policies and value function approximation is used and the transitions are deterministic, then the model is PAC learnable in a polynomial number of episodes (and the complexity is independent of the number of observations). The paper is well written overall. The proofs are quite clear and quite thorough. I am not quite sure that the 16 pages of technical proofs in the appendix are suitable for a conference; the paper may better fit a journal format.